Redundancy 


Hardware redundancy 

- add extra hardware for detection or tolerating faults 

Software redundancy 

- add extra software for detection and possibly tolerating faults 

Information redundancy 

- extra information, i.e. codes 

Time redundancy 

- extra time for performing tasks for fault tolerance 
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Error Detection 


ideal check 

- determined solely from specification 

- complete, correct 

- check should be independent from system 


check fails if system crashes 


acceptable check 

- cost 

- reasonable check, e.g. monitor rate of change 

diagnostics 

- performed “by system on system components” 

- e.g. power-up diagnostics 
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error might propagate and spread 

identify boundaries to state beyond which no information 


Damage Confinement 


statically 


> e.g. fire wall 
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forward recovery 

- try to make state error-free 

- need accurate assessment of damage 

- highly application-dependent 
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Fault Treatment 


if transient fault: restart system, go to error-free state 
system repair 

- on-line, no manual intervention, (automatic) 

- dynamic system reconfiguration 

- spare (hot or cold) 
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Fault Coverage 

measure of system’s ability to perform: 

- fault detection 

- fault location 

- fault containment 

- (and/or fault recovery) 

C = P(fault recovery | fault existence), 

♦ Note: 

- recovery implies that the system as a whole is operational 

- this does not imply that a “repair” occurred 

- e.g. duplex system with benign fault can recover to continue 
operation on one non-faulty processor 
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Hardware Redundancy 

Passive (static) 

- uses fault masking to hide occurrence of fault 

- no action from the system is required 

- e.g. voting 

Active (dynamic) 

- uses comparison for detection and/or diagnoses 

- remove faulty hardware from system => reconfiguration 

♦ Hybrid 

combine both approaches 

- masking until diagnostic complete 

- expensive, but better to achieve higher reliability 
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Passive Hardware Redundancy 

N-Modular Redundancy (NMR) 

- N independent modules replicate the same function 

parallelism 

- results are voted on 

- requirements: N >= 3 

TMR (Triple Modular Redundancy) 

i- 1 Voter: 

• is single point of failure. 

• could be very simple, 

• but who guards the guard ? 

CS449/549 Fault-Tolerant Systems Sequence 3 
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Who guards the guards 


? 


Replicate voters 



Restoring Organ: 

since it produces 3 correct 
outputs even if one input is 
faulty. 


eliminate single point of failure 
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Who guards the guards? 


Multistage TMR with replicate voters 
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if inputs are independent, the NMR can mask up to 


(V-1) 


Faults 


e.g. 1 bit majority voter (3 AND gates ORed) 
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Z= 1 if 2 of 3 inputs are 1 
Z=0 if 2 of 3 inputs are 0 
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Active Hardware Redundancy 

Duplicate and Compare 



- can only detect, but NOT diagnose 

i.e. fault detection, no fault-tolerance 

- may order shutdown 

- comparator is single point of failure 

simple implementation: 2 input XOR for single bit compare 
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Active Hardware Redundancy 

0utput Johnson 1989 



Fig. 3.13 The necessary comparisons in duplication with comparison can be 
implemented in software. Both processors must agree that results match before 
an output is generated. 
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Active Hardware Redundancy 

Stand-by-sparing 

- only one module is driving outputs 

- other modules are 

idle => hot spares 
shut down => cold spares 

- error detection => switch to a new module 

- hot spares 

no power-up delays 
power consumption 

- cold spares 

opposite of hot spares 
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Johnson 1989 


Input 



nto 1 
Switch 


Output 


2007 A.Ji 


Fig. 3.14 In standby sparing, one of n modules is used to provide the system's 
output, and the remaining n - 1 modules serve as spares. Error detection tech¬ 
niques identify faulty modules so that a fault-free module is always selected to 
provide the system's output. 


3 


Active Hardware Redundancy 

Pair and Spare 

- duplication combined with compare & spare 

- 2 modules are always on-line 

- 2-of-N switch 

- pairs are often combined 
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Fig. 3.15 The pair-and-a-spare technique combines duplication with compari¬ 
son and standby sparing. Two modules are always online and compared, and 
any spare can replace either of the online modules. 
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Hybrid Hardware Redundancy 

NMR with spares 

- N active + S spare modules (off-line) 

- voting and comparison 

- replace erroneous module from spare pool 

- maintains N constant 

- uses N-of-(N+S) switch 

example: 2 faults at 2 different times 

- hybrid solution => N = 4 

- passive solution => N = 5 

(N-l) 

2 
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Johnson 1989 


Output 


Fig. 3.16 ^-modular redundancy with spares combines NMR and standby spar¬ 
ing. The voted output is used to identify faulty modules, which are then replaced 

with spares. 
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Hybrid Hardware Redundancy 

Self-purging NMR (joh 89 Fig 3.17) 

- all modules are active 

- exclude modules on error detection 

vote & compare 

- N will decrease with faults 
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Fig. 3.17 Self-purging redundancy uses the system output to remove modules 
whose output disagrees with the system output. (From [Losq, 1976] © 1976 IEEE) 
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Hybrid Hardware Redundancy 

Triple-Duplex (Johnson 1989 Fig. 3.26, page 80) 

- redundant self checking 

- each node is really 2 modules + comparator 

- self-disable in event of error 

- “simulate” benign behavior 

- triple-triplex used in Boeing 777 primary flight computer 

each triplex node employs 3 dissimilar processors 
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